DOC  FILE  COPY  AD  A09264.6 


SECURITY  CLASSIFICATION  OF  THII  PAOB  (Win  Dw  Kntered) 


REPORT  DOCUMENTATION  PAGE 

1.  REPORT  NUMBER 


TTHEPOET  NUMBER  y 

iq  16669,4-Mj 


J\ 

Cl 


[2,  GOVT  ACCESSION  I  O 


ATD-  AQ? 


1 4.  TITLE  («irf  Mllllt) 


(k 

'■  II  w 


Styles  of  Data  Analysis,  and  Their 
Implications  for  Statistical  Computing 


■? 


ir-xupHowj  /  "T - ' - Tr-  > 

3j  T/Tuk-^  Q. 


».  performing  organization  name  and  ADDRESS 

Princeton  University/ 
Princeton,  NJ  08544 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 
I.  RECIPIENT'S  CATALOG  NUMBER 


N/A 


».  TYPE  OF  REPORT  A  PERIOD  COVERED 

REPRINT 


6.  PERFORMING  ORO.  REPORT  NUMBER 

N/A  _ 


CONTRACT  OR  ORANT  NUMBER^*) 


p  j  DAAG29-79-C-/2/5' y 


10.  PROORAM  ELEMENT,  PROJECT,  TA*K 
AREA  A  WORK  UNIT  NUMBERS 

N/A 


II,  CONTROLLING  OFFICE  NAME  ANO  ADDRESS 

US  Army  Research  Office 
P0  Box  12211 

Research  Triangle  Park,  NC  27709 


-I 

rn 


14.  MONITORING  AGENCY  name  A  AOORESSf//  dl  Iterant  trom  Controlling  OKI c.i 

U  \ 


REPORTiDATB' 

,,  mZl _ 

y  NUMBER-OF  pages 


11 


18.  SECURITY  CLASS,  (ol  this  report) 

Unci  ass i fled 


tSa.  OECLASSI  FI  CATION/ DOWN  GRADING 
SCHEDULE 


IS.  DISTRIBUTION  STATEMENT  (o(  this  Report) 

Submitted  for  announcement  only 


dtic 

ELECTE 

OEC  8  1980  1 


17.  DISTRIBUTION  STATEMENT  (ol  the  abotroet  entered  In  Block  30,  It  dlllarant  Irom  Report) 


IB.  SUPPLEMENTARY  NOTES 


U.  KEY  WORDS  (Continue  on  rararaa  alda  It  necessary  and  Identity  by  block  number) 


20, 


ABSTRACT  (Continue  on  reverse  side  It  neceeeaay  and  Identify  by  block  number) 


DD  ,JANM7J  1473  EDITION  of;  NOVJS^S  obsolet^}^  £  f>  T2  0  1  IT  5 

JiY  O  *i  y  SECURITY  CLASSIFICATION  OF  THIS  PAGE  (When  Dale  Entered) 


/\(L0  Ibiolpy* 

21 

Styles  of  data  analysis,  and  their  implications 
for  statistical  computing 

Tukey,  J.W.,  Princeton,  USA  Session  Al/second  paper 


Summary 

Statistical  computing  almost  inevitably  implies  special  programs,  systems,  or 
languages.  We  are  gradually  learning  how  to  describe  --  and  attain  --  good  practice 
from  such  points  of  view  as  easy  use,  input  compatibility  with  people,  decent  numerical- 
analysis  performance,  and  even  easy  maintainability.  We  must  do  more  of  all  of  this,  as 
1  hope  everyone  will  agree.  Wc  must  also  adapt  to  the  needs  of  the  times.  This  requires 
looking  at  the  latest  styles  of  data  analysis  and  trying  to  understand  their  structure  from 
the  user's  point  of  view:  Not  just  exploratory  and  confirmatory,  but  the  pieces  these  can 
share  and  the  pieces  that  must  be  different.  Robust  techniques,  not  just  alone  but  in 
parallel.  Things  the  computer  has  yet  to  leant  to  do,  as  well  as  those  it  can  already  do. 

Keywords:  data  analysis,  cxploranry,  confirmatory,  diagnostic,  middleput,  preoutput, 
data  expansion,  autonomic  judgment,  SDAPs. 

Every  set  of  special  programs,  every  system,  every  language  reflects,  perhaps  impli¬ 
citly,  an  understanding  of  one  or  more  styles  of  data  analysis.  This  is  unavoidable.  This 
makes  the  user  happy  when  the  style  he  wants  to  use  is  among  those  reflected.  With 
relatively  few  exceptions  -  as  we  must  regretfully  expect  --  today’s  tools  --  programs,  sys¬ 
tems,  languages  --  reflect  yesterday’s  styles.  It  is  high  time  for  a  fashion  show,  for  an 
introduction  to  the  styles  of  the  new  season. 

Robust  Techniques. 

Some  of  you  may  think  that  robust  techniques  of  analysis  is  the  only  major  new 
style.  We  will  see  shortly  that  this  need  not  be  so.  It  is.  of  course,  a  very'  important 
class  of  innovations.  Here  we  shall  discuss  it  only  briefly  and  generally,  emphasizing 
tiiat 

•  for  the  present  at  least,  we  expect  to  provide  the  results  of  bath  a  classical  tmd  a 
robust/resistant  analysis. 

•  iterative  calculations  can  be  expected  to  occur,  perhaps  in  multiple  loops,  inside 
(almost)  every  robust/resistant  analysis. 

•  wc  badly  need  procedures  that  find  --  and  report  to  the  user  —  multiple  answers. 

-Prepared  in  pin  in  connection  with  research  at  Princeton  Umverstt)  supported  by  the  U.S.  Army  Research  Office 
(Purtum). 
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Not  one  of  these  three  makes  the  planning  of  tools  easier,  but  all  three  have  to  be  faced. 
Wants  of  users. 

Users  would  like  a  single  nswer,  without  the  need  to  think  about  it.  If  we  satisfy 
this  desire,  our  particular  users  will  function  poorly,  and  our  programs  and  systems  will 
slowly  but  steadily  get  a  (well-deserved)  bad  reputation. 

Learning  how  to  convey  alternative  answers,  caveats  and  warnings  in  such  a  way  — 
very  specifically  including  in  such  a  format  --  as  to  combine 

•  reduced  user  discomfort,  with 

•  increased  user  response 

is  one  of  the  main  tasks  confronting  statistical  computing.  We  have  tackled  the  human 
interface  at  input  —  at  least  to  a  degree  --  it  is  now  high  time  for  us  to  tackle  the  more 
difficult  human  interface  at  output.  (If  doing  this  well  requires  the  techniques  we  ordi¬ 
narily  relegate  to  “advertising  people’’,  such  as  motivation  research,  then  we  will  have  to 
do  what  is  required.) 

Data  analysis. 

Quite  the  opposite  of  data  reduction,  data  analysis  is  pretty  well  characterized  by 
"making  more  numbers  out  of  fewer”.  (Once  wc  say  this  seriously,  the  reasonability  - 
cvcn  inevitability  --  of  parallel  analyses  of  a  single  set  of  data  becomes  dear,  since 
uniqueness  is  not  a  natural  consequence  of  "fewer  -  more”.)  We  only  complete  reduc¬ 
ing  to  fewer  numbers  when  wc  have  calculated  a  body  of  numbers  (part  of  our  analysis) 
of  which  wc  are  willing  to  say,  “we  have  looked,  and  found  no  indication  of  any  further 
informative  structure”. 

When  dealing  with  a  single  batch  of  numbers,  for  example,  we  can  report  only  a 
location  number  and  a  scale  number  IF  and  ONLY  IF  we  have  calculated  the  residuals 
and  carefully  examined  them  for  any  informative  structure.  This  means  looking,  at  least 

•  at  the  large-scale  structure  of  their  distribution  -  should  there  be  warnings  of 
stretched  (or  squeezed-in)  tails,  of  skewness,  of  bi-  or  multi-  modality? 

•  at  their  granularity  -  are  the  values  actually  reported  coarse-grained  enough  for  this 
to  deserve  notice? 

•  (if  the  values  occurred  or  were  observed  over  time,  or  doing  some  other  linear  vari¬ 
able),  is  there  evidence  of  any  substantial  time  dependence? 

[probably  a  few  more]. 

Avoiding  the  pitfalls  of  “data  reduction"  stresses  our  programs,  our  computers,  and  our 
thoughts,  yet  it  is  one  of  the  most  important  things  for  us  to  do  better  and  better. 
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Basic  styles. 

For  many  statistical  data  analysts  and  in  an  increasing  collection  of  areas  of  applica¬ 
tion,  the  distinction  between 

•  exploratory  data  analysis,  AND 

•  critical  or  confirmatory  data  analysis 

is  quite  dear  and  a  part  of  routine  thought  processes.  For  others,  this  may  not  be  ;o. 

Exploratory  data  analysis  is  detective  work  -  numerical,  counting,  or  graphical 
detective  work  —  analysis  devoted  to  finding  indications  -  the  “clues"  of  data  analysis  ~ 
of  what  appears  to  be  going  on,  of  what  might  be  going  on.  The  detective  in' a  classical 
detective  story  is  effective  when  he  or  she  finds  many  clues,  of  which  some  arc  mislead¬ 
ing.  A  set  of  exploratory'  data  analysis  tools  arc  good,  arc  useful,  when  they  find  many 
indications,  not  all  of  which  we  can  be  sure  about,  not  all  of  which  will  be  confirmed 
whcn-and-IF  we  can  examine  additional  data. 

Critical  data  analysis  involves  the  assessment  of  part  of  the  uncertainty  of  such  indi¬ 
cations  --  of  that  part  corresponding  to  the  differences  revealed  in  the  data  that  was 
analyzed.  Standard  errors,  tests  of  directionality  (and  occasionally,  I  fear,  even  tests  of 
significance  that  do  not  involve  directionality),  and  confidence  statements  all  use 
revealed  differences  to  assess  that  part  of  the  uncertainty  that  is  calculable  from  the  data. 
ALL  also  require  good  judgment  in  assessing  that  part  of  the  uncertainty  not  likely  to  be 
revealed,  at  least  by  data  limited  in  those  ways  in  which  the  actual  data  is  limited. 

Much  data  is  inevitably  submitted  to  first  exploratory  -  whether  formal  or  informal 
-  and  then  critical  techniques.  (Who  can  analyze  the  economics  of  this  century  free  of 
the  exploratory  result  that  there  seemed  to  be  a  depression  in  1929?)  We  arc  all  aware 
that  such  overlap  has  its  problems;  we  need  to  recognize  that  we  cannot  always  eliminate 
them. 

Confirmatory  data  analysis,  as  we  shall  use  the  term,  is  critical  data  analysis  on  an 
unexplored  body  of  data  believed  to  be  cither 

•  parallel  to  some  body  (or  bodies)  whose  exploratory  analysis  (formal  or  informal) 
has  suggested  an  analysis  -  and,  ordinarily,  a  focus  on  certain  constants  produced 
in  that  analysis  --  for  the  data  at  hand,  OR 

•  of  such  a  form  and  character  that  either  theory  (in  a  scientific  or  technological 
field)  or  purpose  (as  often  in  business  or  government)  prescribes  the  analysis,  OR 

•  of  such  a  form  that  some  standard  (really  default)  analysis  is  almost  inevitable,  OR 

•  gathered  in  a  carefully  planned  way  with  this  specific  analysis  in  mind. 

The  distinction  between  confirmatory  and  merely  critical  analyses  is  crucial,  for  the 
understanding  and  practice  of  data  analysis.  However,  since  its  penetration  into  statisti¬ 
cal  computing  seems  likely  to  be  confined  to  questions  of  caveats  and  automatic  warn¬ 
ings,  we  will  not  try  to  discuss  it  more  deeply  here. 

The  tasks  of  inventors  and  rcalizcrs  of  statistical  computing  tools  are  chiefly  directed 
to  processes  --  rather  than  to  ambient  philosophy.  So  let  us  to  our  processes. 
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Processes  of  EDA. 

The  most  helpful,  and  most  important,  subdivision  of  processes  of  exploratory  data 
analysis  divides  them  into 

•  autonomic  data  analysis  processes  —  ADAPs  -  that  convert  data  to  analyses,  AND 

•  diagnostic  data  analysis  processes  --  DDAPs  --  that  look  at  (aspects  of)  the  results  of 
analysis  and  endeavor  to  communicate  with  the  analyst  about  what  can  be  “seen". 

It  will  often  be  WRONG  to  separate  ADAPs  and  DDAPs  in  the  functioning  of  statistical 
computing  tools;  it  will  often  be  essential  to  separate  them  in  thinking  about  what  is  to  be 
done. 

*  further  subdivision  * 

As  we  will  shortly  illustrate,  ADAPs  themselves  usually  divide  into  two  parts; 

•  autonomic  data  expansion  processes  —  ADEs  --  that  convert  our  data  into  more 
numbers  (it  will  be  these  that  our  diagnostic  processes  are  likely  to  need  to  feed 
upon).  AND 

•  optimistic  concentrators  —  OCONs  —  that  convert  the  more  numbers  into  the  few 
that  we  might  be  satisfied  with  if  our  DDAPs  have  found  nothing  further  relevant. 

Two  reasons  why  this  distinction  is  important  are  (1)  that  we  may  properly  choose  to 
pair  one  of  several  OCONs  with  a  particular  ADE,  and  (2)  it  may  be  wise  to  have  an 
ADE  produce,  cither  actually  or  potentially,  more  different  things  than  will  he  used  in 
any  one  situation. 

*  a  simple  example,  ADEs  * 

If  we  start  with  just  a  batch  of  numbers,  our  ADE  can  reasonably  make  a  variety  of 
typical  values  (median,  midmean,  biweight-6,  and  even  mean)  and  a  variety  of  measures 
of  spread(s.  median  deviation,  |*cudovarianees.  etc.)  and  a  variety  of  measures  of 
general  distribution  (e.g.  letter  values,  which  are  order-statistic  related  [Tukcy,  1977]). 
It  can  also  reasonably  make  one  or  more  kinds  of  residuals,  and  may  not  want  to  destroy 
the  individual  values.  This  is  clearly  data  expansion.  We  intend  such  an  ADE  to  make 
all  the  standard  things  that  cither  OCONs  or  DDAPs  might  require.  (In  special  cir¬ 
cumstances,  ADEs  with  even  more  diversified  outputs  may  be  appropriate.) 

*  a  simple  example,  OCONs  * 

OCONs  that  might  well  be  paired  with  this  ADE  might  produce,  alternatively, 

1)  a  mean  and  a  sample  standard  deviation, 

2)  a  mean  and  its  standard  error, 

3)  all  three  of  the  above, 

4)  a  five-,  seven-  (or  more)  number  summary  (Tukcy,  1977), 

5)  a  suspended  rootogram,  cither  explicit  (Kurtz  ct  al  1965,  Tukcy  1970-71)  or  impli¬ 
cit  (Tukcy  1977,  Chapter  17). 

*  a  simple  example,  DDAPs  * 


I 
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DDAPi  that  we  might  want  to  connect  to  this  tame  ADE,  might  include,  allema- 
lively,  or  together 

1)  a  piebian  probability  plot  or  tome  up-to-date  Improvement, 

2)  aummerized  Information,  ai  by  g  and  h  (Tukey,  unpubUahed,  Hoeglin  and  Peters, 
1979),  about  distribution  shape, 

3)  ordered  values  of  leaps  (differences  in  adjacent  order-statistics  divided  by  differ¬ 
ences  of  corresponding  theoretical  order-statistic  typical  values), 

4)  (if  the  data  was  collected  in  order  aooording  to  time,  specs,  etc.)  plots  of  residuals 
in  order  of  collection,  both  raw  and  smoothed. 

5)  and  so  forth. 

Why  are  these  things  being  produced?  As  a  guide  to  judgment,  as  a  basis  for  choice. 
What  choice?  The  choice  of  what  to  do  next,  of  whether  or  not  to  output  the  preoutput 
of  the  OCON,  of  what  ADEs,  OCONs,  and  DDAPs  to  apply  in  the  next  cycle  of 
exploration  (special  case:  the  choice  to  have  no  next  cycle). 

*  the  choice  process  * 

Today  our  choices  are  mainly  matters  of  human  choice.  Tomorrow  there  can  be 
large  elements  of  autonomy  in  our  choices.  We  have  to  think  through  our  DDAPs  with 
j  both  human  and  autonomic  users  in  mind.  Human  choioe  will  often  be  best  fed  by 
displays  -  pictures  are  supposed  to  be  worth  many  words,  often  they  are  worth  even 
|  more  numbers.  Autonomic  choices  may  have  to  be  fed  by  summarise  of  what  would 
!  have  been  displays.  For  the  nearer  future,  then,  autonomic  choices  are  likely  to  need  to 
be  paralleled  by  human  lookings,  looking  most  particularly  at  whatever  aspects  of  the 
display  summarized  for  the  autonomic  chooser  are  nor  covered  by  the  summaries. 

Process  ss  of  CDA. 

We  will  do  well  to  think  of  our  processes  of  critical/confirmatory  data  analysis  as 
following  after  a  sequence  of  EDA  cycles.  Indeed  we  can  usually  think  of  hitching  a 

CDAP  ••  a  critical  data  analysis  process 

to  an  ADE- OCON  pair  as  the  typical  way  to  do  CDA.  Where  reasonable,  we  will  want 
to  use  a  general  CDAP. 

There  are  now  a  number  of  kinds  of  general  CDA  approaches,  including: 

•  differences  from  piece  to  piece,  implemented  with  Student's  r,  Wiiooxon  or  even 
biweight  procedures, 

•  jackknife  procedures,  usually  with  Student's  t, 

•  half-sample  procedures  (paired  or  not) 

•  boon  trap  procedures. 

Tomorrow  there  well  may  be  more. 

*  further  example,  OCON-CDAP  * 

If  we  are  to  hitch  our  chosen  CDAP  onto  an  OCON,  the  output  of  our  OCON 


s 
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must  be  exteiuive  enough.  If  (he  data  ire  so  structured  to  nuke  e  factorisi  analysis  of 
variance  reasonable,  it  will  NOT,  for  example,  suffice  for  the  OCON  to  provide  only  the 
analyaii  of  variance  table.  So  many  have  ao  often  criticized  paper*  that  do  not  give  the 
ettimated  effect*  for  the  varioui  treatment*.  OCON*  that  fail  in  thi*  way  fail  miaerabiy. 

Indeed,  we  may  well  want  our  OCON  to  carry  out  the  aggregation*  and  pooling* 
diicuaaed  and  illustrated  in  Green  and  Tukcy  (1960).  It  would  then  report  the  con¬ 
densed  anova  table  and  the  effect*  and  interaction*  that  remain  apparently  relevant. 


Fracas*  and  reality. 

We  have  been  describing  the  logical  steps  of  a  data  analysis,  A  statistical  comput¬ 
ing  system  need  not  operate  in  the  way  these  steps  would  naivsiy  suggest.  Steps  may 
only  be  carriod  out  when  their  results  are  needed,  Results  to  be  used  twice  or  more  may 
be  freely  stored  or  equally  freely  forgotten  and  recomputed. 

Understanding  how  to  structure  the  calculations  and  rememberings  may  appropri¬ 
ately  be  a  quite  separate  process,  but  it  will  fail  to  give  us  the  support  we  need  unless  it 
is  thought  through  in  terms  of  a  logical  and  relevant  understanding  of  the  steps  of  the 
data  analysis,  some  of  which  we  have  just  described. 

We  dare  not  constrain  implementation;  we  must  constrain  attitude  and  understand¬ 
ing. 


Sosas  vsrbsJ  schematics. 

With  this  cavsat  that  we  are  NOT  trying  to  describe  implementation,  we  can  go 
ahead  with  some  schematic  descriptions.  A*  w*  do  this,  we  will  find  it  helpful  to  have 
words  for  at  least  three  kinds  of  intermediate  remits: 

•  pnoutput,  describing  what  may  later  be  uaed  as  either  output  or  input  to  another 
step  (mainly,  here,  from  OCONs) 

•  mtddUput,  describing  extensive  material  intended  for  another  step  (mainly,  here, 
from  ADEs  and  to  DDAPs) 

•  dkignortici,  describing  material  to  be  offered  to  guide  choice,  either  human  or  auto¬ 
nomic. 

The  usee  of  these  are  different  enough  that  it  seems  likely  that  they  will  be  implemented 
differently. 

The  bade  schematic  for  ao  ADAP  -  an  autonomic  data  analysis  process  is,  then 

input  -  ADB  -  middleput  -*  OCON  -  preoutput 
t 


To  the  arrow  coming  down  frtsn  the  middleput  we  would  usually  attach  either 


autonomic  choice 
hijman  scrutiny 


DDAP 
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or,  temporarily, 


human  choice  -  DDAP 

Hero  choice  would  act  up  the  next  itep,  specifying  ADEi,  OCONi,  DDAPs  (und 
possibly  Autonomic  Judgere)  as  well  ai  making  deciiioni  about  which  preoutputi, 
already  (logically,  but  not  neceauriiy  in  implemenution)  generated,  are  now  certain  to 
be  output, 

Notice  the  plurals  "ADEi,  OCONs,  DDAPi",  Any  one  itep  may  involve  more 
than  one  of  each  kind.  Several  in  the  tame  step  may  represent  tithtr  deep  understand* 
ing  of  what  la  needed  or  scratching  around  in  the  dark, 

As  we  get  more  used  to  alternative  outputs  and  alternative  ADAPs,  we  will  find 
ourselves  more  and  more  In  need  of 

SDAPs  -  selective  data  analysis  procedures 

in  which  the  results  of  2  or  more  (usually  more)  approaches  are  examined  autonomi* 
caUy,  with  the  result  that  some  (maybe  more,  maybe  all)  of  these  results  are  passed  on 
or  outputted.  Here  we  have  almost  no  experience,  so  Colin  Mallows  and  1  are  trying  to  9 
produce  a  good  SDAP  for  the  problem: 

data  structure  -  a  batch 

objective  -  shape  of  distribution  . 

Time  will  tell. 

gome  pictorial 

We  dose  this  discuarion  with  some  pictures  of  the  flow  of  information  and  control 
in 

1)  a  single  ADAP 

2)  a  step  of  EDA 

3)  an  extended  EDA  process 

for  which  see  exhibits  1,  2,  and  3.  Remember  that  the  elements  of  these  schematics  are 
logical  steps  and  need  not  reflect  sped  fie  implementations  or  spedfle  choices  of  time  at 
which  things  are  calculated, 

She  of  Interaction, 

At  least  until  autonomic  judgment  Is  developed  far  beyond  its  present  level,  the  dls* 
cuasion  above  aaoumee  human  intervention  at  suitable  intervals,  neither  loo  dose 
together  nor  too  widely  separated,  I  consider  heretical  both: 

•  the  idee  that  an  analyst  should  spedfy  each  step  of  his/her  analysis,  one  a/ter 
another  -  this  assumes  that  planning  parts  of  analyses  la  much  easier  than  in  truth 
it  is;  that  every  user  will,  for  instance,  instinctively  do  the  right  numerical  analysis. 

•  the  idee  that  a  package  should  take  the  data  away  and  come  beck  with  the  answers 
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••  this  assumes  that  pluaiaf  m  •mall  analysis  is  Much  easier  than  in  truth  it  U. 
'Hie  proper  ipadng  between  human  intervention!  will  liowly  grow  a*  the  year*  and 
decade#  pan  by,  but,  whatever  the  epoch,  it  will  aiwayi  be  possible  both  to  intervene  too 
frequently  and  to  intervene  too  infrequently.  ( 

Keeping  intcrvention-ipedng  roughly  tuned  to  our  insight#  and  capabilities  will  be  a 
challenging  Important  problem  throughout  the  foreseeable  future, 

Multiple  answer# 

We  stressed  the  need  for  multiple  answer#  in  connection  with  robust-resistant 
methods,  This  need  existed  when  only  classical  procedures  were  being  thought  of;  it  will 
exist  In  the  far  future,  when,  perchance,  all  the  procedures  we  now  know  have  been 
replaced, 

We  need  only  look  at  multiple  regression  without  prespedficAtion  of  which  carriers 
(out  of  a  specified  collection)  are  to  be  used,  The  methods  of  Fumivai  and  Wilson 
[1974]  make  it  quite  feasible  to  learn  both  which  subsets  appear  to  do  best  and  bow  well 
they  appear  to  do.  (If  we  have  only  10  carriers,  say,  the  methods  of  Daniel  and  Wood 
(1971(1980))  will  allow  us  to  look  at  all  2 10  *  1024  possibilities,)  Why  were  users  and 
analysis  so  willing  to  demand  multiple  answers  hers? 

I  suggest  that  the  same  reasons  will  apply  to  wider  and  wider  areas  of  analysis  as  we 
come  to  recognize  the  nature  and  diversity  of  the  possible  analyses  of  each  of  many 
kinds  of  problems.  Consider  multiple  regression  on  a  specified  set  of  carriers  as  an 
example.  The  development  of  techniques  for  identifying  "high-leverage  points"  hat  now 
been  extended  (Andrews  and  Fregibon  1978)  to  the  identification  of  "high-leverage 
groups",  and  will  inevitably  extend  to  procedures  for  clustering  (plausibly  on  jr's  and  y's 
together)  all  the  points  in  high-leverage  entitles,  if  there  are  k  such  clusters  (some  or  all 
may  be  single  points)  there  are  2*  regreatons,  one  obtained  by  setting  aside  each  subcoi- 
lection  of  these  dusters.  I  suspect  that  procedures  for; 

•  telling  us  about  all  2*  regreed  on# ,  including  their  apparent  behavior  at  each  cluster, 

•  sorting  out,  algorithmically,  those  of  the  2*  which  seem  intrinsically  most  likely  to 
interest  us,  AND  even  for 

•  blending  together  regressions  for  different  tubooilectiont  that  lead  to  seemingly  - 
but  far  from  certainly  ~  different  regressions 

will,  in  due  course,  prove  to  be  as  useful  here  at  results  for  multiple  subsets  have  proved 
co  be  in  the  carriers  unspecified  cate. 

In  a  word  or  two,  I  believe  that  "points  unspecified"  makes  as  much  tense  as  "car¬ 
riers  unspecified"  and  that  both  will  always  be  needed,  (At  least  until  they  are  sub¬ 
sumed  Into  still  more  flexible  deecnptlon#  of  what  is  to  be  done,) 


Actual  implementation 

The  descriptions  above  have  been  wholly  human-directed,  emphasizing  input, 
choices,  and  output.  (As  I  am  not  a  system  designer,  it  would  be  silly  if  they  were  not.) 
We  have  valued  that  they  were  not  intended  to  describe  implementation,  but  seme 
examples  to  emphasize  this  are  not  likely  to  be  out  of  place. 

We  have  described  our  OCONs  as  chosen  at  the  same  times  as  our  ADEs.  This 
does  not  imply  that  they  need  be  implemented  at  the  same  internal  time  as  their  ADEs. 
They  only  exist  to  feed  other  OUTs,  CDAPs,  or  SDAPs.  What  is  required  is  wily  this: 

•  when  their  preoutputs  are  called  for,  they  will  be  returned. 

This  need  not  require  us  to  store  the  preoutputs  themselves.  Storing  any  one  of: 

•  their  preoutput 

•  middleput  and  OCON  (implicit  or  explicit),  ready  to  make  preoutputs,  OR 

•  input,  ADE,  and  OCON,  ready  to  make  preoutputs 

can  service  the  need.  Which  to  do  is  a  system  designer’s  choice. 

That  the  user  cannot  tell  directly  which  of  these  has  been  done  is  a  proper  demand, 
levied  by  the  user  community  on  the  system  designer. 
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Exhibit  3 


An  extended  EDA  process 
(dashed  lines  show  implementation  of  judgment) 


data  structure;  variable  names;  prejudices 


